Exodus - Exploring SMT for EU Institutions

نویسندگان

  • Michael Jellinghaus
  • Alexandros Poulis
  • David Kolovratník
چکیده

In this paper, we describe Exodus, a joint pilot project of the European Commission’s Directorate-General for Translation (DGT) and the European Parliament’s DirectorateGeneral for Translation (DG TRAD) which explores the potential of deploying new approaches to machine translation in European institutions. We have participated in the English-to-French track of this year’s WMT10 shared translation task using a system trained on data previously extracted from large inhouse translation memories. 1 Project Background 1.1 Translation at EU Institutions The European Union’s policy on multilingualism1 requires enormous amounts of documents to be translated into the 23 official languages (which yield 506 translation directions). To cope with this task, the EU has the biggest translation service in the world, employing almost 5000 internal staff as translators (out of which 1750 at the European Commission (EC) and 760 at the European Parliament (EP) alone), backed up by more than 2000 support staff. In 2009, the total output of the Commission’s Directorate-General for Translation (DGT) and the Parliament’s Directorate-General for Translation (DG TRAD) together was more than 3 million translated pages. Thus, it is not surprising that the cost of all translation and interpreting services of all the EU institutions amounts to 1% of the annual EU budget (2008 figures). According to our estimations, this is more than e 1 billion per year. 1.2 Machine Translation and Other Translation Technologies at EU Institutions In order to make the translators’ work more efficient so that they can translate more pages in the same time, a number of tools like terminology databases, bilingual concordancers, and, most importantly, translation memories are at their disposition, most of which are heavily used. http://ec.europa.eu/education/ languages/eu-language-policy/index en.htm In real translation production scenarios, Machine Translation is usually used to complement translation memory tools (TM tool). Translation memories are databases that contain text segments (usually sentences) that are stored together with their translations. Each such pair of source and target language segments is called a translation unit. Translation units also contain useful meta-data (creation date, document type, client, etc.) that allow us to filter the data both for translation and machine translation purposes. A TM tool tries to match the segments within a document that needs to be translated with segments in the translation memory and propose translations. If the memory contains an identical string then we have a socalled exact or 100% match which yields a very reliable translation. Approximate or partial matches are called fuzzy matches and usually, the minimum value of a fuzzy match is set to 65%–70%. Lower matches are not considered as usable since they demand more editing time than typing a translation from scratch. First experiments have shown that the quality of SMT output for certain language pairs is equal or similar to 70% fuzzy matches. Consequently, the cases where machine translation can play a helpful role in this context is when, for a segment to be translated, there is no exact match and the available fuzzy matches do not exceed a certain threshold. This threshold in our case is expected to be 85% or lower. To this end, there exists a system called ECMT (European Commission Machine Translation; also accessible to other European institutions) which is a rulebased system. However, only certain translation directions are covered by ECMT, and its maintenance is quite complicated and requires quite a lot of dedicated and specialized human resources. In the light of these facts and with the addition of the languages of (prospective) new member states, statistical approaches to machine translation seem to offer a viable alternative. First of all, SMT is data-driven, i.e. it exploits parallel corpora of which there are plenty at the EU institutions in the form of translation memories. Translation memories have two main advantages over other parallel corpora. First of all, they contain almost exclusively perfectly aligned segments, as each segment is stored together with its translation, and secondly,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus

The European Union is a great source of high quality documents with translations into several languages. Parallel corpora from its publications are frequently used in various tasks, machine translation in particular. A source that has not systematically been explored yet is the EU Bookshop – an online service and archive of publications from various European institutions. The service contains a...

متن کامل

Brain drain, brain gain or brain sharing?

Scientists have always been ‘sans frontières’, driven by the need to acquire new knowledge and skills and to cooperate with colleagues irrespective of national borders or political divides. In fact, the migration of scientists between countries at all career stages has increased steadily over time. In a related but distinct trend, the incidence of co-authorship reflecting greater international ...

متن کامل

A Conflict of Institutions: The WTO and EU Agricultural Policy

Europe has long defied GATT/WTO rules on agricultural trade during adjudication of trade disputes, but agreed to major reforms of its agricultural policies in the Uruguay Round. Such variation in liberalization outcomes raises questions about when nations will delegate to international institutions and how EU institutions influence its trade policy. This paper focuses on the policy process to e...

متن کامل

TmTriangulate: A Tool for Phrase Table Triangulation

This work was supported by the grants no 645452 (QT21) and no 644402 (HimL) of the EU and SVV 260 104 of the Czech Republic. We used language resources hosted by the LINDAT/CLARIN project LM2010013 of the Ministry of Education, Youth and Sports. Introduction Under-resourced language pair: Scarcity of parallel corpora SMT Problem: No direct data → no SMT training Insufficient data → poor SMT per...

متن کامل

SEMICONDUCTOR-METAL TRANSITIONS IN TmSe-TmTe AND TmSe-EuSe

A semiconductor to metal transition (SMT) occurs either as a function of composition or external pressure in the TmSe-TmTe and TmSe-EuSe pseudobinary com ounds. The transition is caused by the delocalization of electrons from the 4flq shell of divalent Tm into the 5d conduction band. The divalent configuration (4f7) of the Eu ions is more stable and therefore is not involved in the SMT. A strik...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010